Route DebuggerController operations through a per-controller worker queue#1090
Conversation
d4973e0 to
3a44260
Compare
| // pair (which guarded against concurrent adapter access from multiple spawned threads) | ||
| // is no longer needed. The new Pause path bypasses this method entirely and calls | ||
| // m_adapter->BreakInto() out-of-band; everything else funnels through here on the worker. | ||
| assert(std::this_thread::get_id() == m_workerThreadId); |
There was a problem hiding this comment.
I'd recommend using BN_ASSERT or BN_RELEASE_ASSERT, depending on whether we want to validate this only in debug builds or also in our release builds. I'd lean towards the latter given that this check is cheap.
| DebuggerController::~DebuggerController() | ||
| { | ||
| { | ||
| std::lock_guard<std::mutex> lock(m_workQueueMutex); |
There was a problem hiding this comment.
I think there's an issue with the way mutexes and condition variables are used here that can lead to missed wakeups. Specifically, DebuggerController::WaitForAdapterStop uses m_adapterStopMutex with its condition variable, while this store occurs with m_workQueueMutex held. This means there's a race that can happen where m_workerShouldExit is set here but DebuggerController::WaitForAdapterStop does not see it.
This is related to the note at https://en.cppreference.com/cpp/thread/condition_variable that says:
Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.
There was a problem hiding this comment.
I changed it to
{
std::scoped_lock shutdownLock(m_workQueueMutex, m_adapterStopMutex);
m_workerShouldExit = true;
}
which I believe should fix the issue
| // (GoAndWaitOnWorker etc.) is the sole notifier; it calls NotifyStopped on every | ||
| // genuine stop regardless of m_userRequestedBreak. m_userRequestedBreak only | ||
| // suppresses conditional-breakpoint auto-resume in ShouldSilentResumeAfterStop. | ||
| m_userRequestedBreak = true; |
There was a problem hiding this comment.
Is it safe to modify m_userRequestedBreak without taking any locks? It appears to potentially be read / written from different threads, so either needs to always be accessed with a lock held or to be std::atomic<bool>.
There was a problem hiding this comment.
this boolean is a leftover from the old code. It is no longer needed in the new design, I will delete it
|
|
||
| private: | ||
| // m_adapter is the active debug adapter for this controller. Written exclusively | ||
| // from the worker thread (in CreateDebugAdapter); read both from the worker |
There was a problem hiding this comment.
Since this field is accessed by multiple threads, don't accesses need to be guarded by a mutex to avoid being data races?
There was a problem hiding this comment.
yeah data races are not the focus of this PR, they will need a separate one to address
|
|
||
| void DebuggerController::WorkerThreadMain() | ||
| { | ||
| m_workerThreadId = std::this_thread::get_id(); |
There was a problem hiding this comment.
This is set on the worker thread but read from other threads without any synchronization, which is a data race. This needs some form of synchronization.
Another option would be to skip storing a thread ID, and instead use a thread-local variable that points to the DebuggerController instance when on the worker thread and is nullptr otherwise. To check whether code is running on the worker thread you'd then instead test t_controllerOnWorker == this. Since it is a thread-local variable there is no need for synchronization.
There was a problem hiding this comment.
I adopted the suggested method
| // caches updated via UpdateCaches) are mutated under the worker's serialization | ||
| // for ops the worker generates (TargetStopped, LaunchFailure, etc.) and under | ||
| // the adapter event thread for spontaneous events (TargetExited, RegisterChanged, | ||
| // ActiveThreadChanged). Concurrent mutation is unlikely in practice because the |
There was a problem hiding this comment.
I'm concerned by this reasoning. It seems to say that this code contains data races, but it's fine because they probably don't happen very often.
A data race is when two or more threads access the same memory location, with at least one of those accesses being a write, and there is no synchronization between the accesses (synchronization here is defined in terms of the happens-before relationship).
In C++, if a data race occurs, the behavior of the program is undefined. The compiler is free to assume that data races do not happen and optimize the code on that basis. This makes it impossible to reason about what the code is actually doing.
| // in-flight op to settle before returning). A timeout of milliseconds::max() means | ||
| // "wait forever" and bypasses wait_for entirely (avoids overflow inside the stdlib). | ||
| template<typename F> | ||
| auto SubmitAndWait(F&& f, std::chrono::milliseconds timeout) |
There was a problem hiding this comment.
What behavior does this have if run on the worker thread? Submit has an explicit case for it. Does this need one? If it's never expected to be called from the worker thread, an assert would serve to document that and catch any potential future misuse.
There was a problem hiding this comment.
After looking at the code closer, I have decided to make both of them NOT re-entrant. There is also no place where they are called in a reentrant way.
Just for the record -- Submit can be made re-entrant safe, and SubmitAndWait, I believe, cannot
| DebugStopReason RestartAndWaitOnWorker(); | ||
| void DetachAndWaitOnWorker(); | ||
| void QuitAndWaitOnWorker(); | ||
| DebugStopReason PauseAndWaitOnWorker(); |
There was a problem hiding this comment.
PauseAndWaitOnWorker appears to not be referenced anywhere.
|
Hi @bdash thanks for the review! I forgot to mention that this PR mainly means to be clean up the lifetime + control-flow issues, so that I can deal with the data races in a follow-up PR. I did a bad job in communicating this, and I guess you could not have known that because the last commit I added did appear to trying to handle the data races. I know that one is not sufficient yet -- I was just testing the general visibility of the idea, and data races will be addressed in a future PR I will go ahead and address the new issues you mention that are introduced by the new code, and have you look at it again |
| // Thread-safety: this body touches m_state's connection/execution status, plus | ||
| // m_lastIP / m_currentIP / m_exitCode on the controller -- all plain (non-atomic, | ||
| // unlocked) fields read from other threads (UI render layer, file accessor) without | ||
| // synchronization. Those are pre-existing data races reported in #1091; this PR |
There was a problem hiding this comment.
The remainder of this comment is in terms of "this PR" and "this commit", which won't make much sense once the change is merged.
I'd probably just end this with "Fixing these data races is tracked in #1091".
4d3a9ae to
4ff2922
Compare
Replaces the "spawn one detached
std::threadper operation" model inDebuggerControllerwith a single per-controller worker thread + FIFO queue, and makes the controller own its state directly instead of bouncing it through the event dispatcher.Implements #1088 and #1089. Supersedes the auto-closed #1087.
Why
The per-op detach pattern (17 sites) captured
thisand detached, so a controller could be destroyed while a worker was still running — the recurring use-after-free family in Sentry (#1083, BINARYNINJA-1A/3X/5Z/60/7M). The queue makes lifetime structural: the worker is owned by the controller and joined in the destructor, so no adapter work can outlive it.What changed
Submit/SubmitAndWait. One worker thread per controller, joined on destruction.Submitruns inline if called from the worker itself (soRestart→Quit+Launchdoesn't self-deadlock).XxxAndWaitgains an optionalstd::chrono::milliseconds timeout(default = wait forever). Existing callers unchanged.BreakInto()on the caller's thread, then queue the actual op. Without this they'd sit behind the in-flight resume forever.AdapterStoppedEventis now an internal worker signal (condvar), not a public dispatcher event. Conditional-breakpoint silent-resume moved onto the worker; the dispatcher's stop machinery,m_lastAdapterStopEventConsumed, andm_suppressResumeEventare gone. Adapters are unchanged.EventHandler(the controller subscribing to its own broadcast) is replaced byApplyOwnStateForEvent, called inline inPostDebuggerEventbefore the event is broadcast. The dispatcher is now pure fan-out to external consumers (UI/plugins). This removes the worker-vs-dispatcher race that broke Restart.m_adapterMutex/m_adapterMutex2guarded concurrent adapter access that can no longer happen; replaced with anassert(on worker thread).NotifyStoppedeven on a user-requested break (the old Pause path used to do it; the new out-of-band Pause doesn't), fixing "target halts but UI still shows running".API compatibility
Public method signatures unchanged (timeout is an appended optional arg). Callers in
core/ffi.cppandui/uinotification.cppneed no changes.Out of scope (follow-ups)
m_targetControlMutexis now largely redundant; kept to limit churn.DebugAdapterwas tried and reverted — the adapter has no independent lifecycle (freed only in~DebuggerState), so it solved no real UAF.Test plan
ninja— core + ui)sityped directly in the LLDB REPL while idle (spontaneous stop)test/debugger_test.py